5.1 Exploring data

5.1.1 Count number of rows

A very early step in any data processing is to understand how many rows are in a data frame, as this often represents the number of participants or total number of trials. This is useful to check at multiple steps of your data processing to make sure you have not done something wrong.

Code
library(readr)
library(dplyr)

# demographics data
data_demographics_raw <- read_csv(file = "../data/raw/data_demographics_raw.csv") 

# self report measure data
data_selfreport_raw <- read_csv(file = "../data/raw/data_selfreport_raw.csv") 

# affect attribution procedure data
data_amp_raw <- read_csv(file = "../data/raw/data_amp_raw.csv")

nrow(data_demographics_raw)
[1] 200
Code
nrow(data_selfreport_raw)
[1] 392
Code
nrow(data_amp_raw)
[1] 8224
  • Why are there different number of rows in the three data frames when this data all comes from the same participants?
  • Why are the numbers not round?

5.1.2 Viewing column names

How would you know what variables are in a data frame? You can view the data frame, but it can also be useful to print them. Knowing what you have is one of the first steps to working with it.

Code
# print all column names
colnames(data_demographics_raw)
 [1] "date"           "time"           "group"          "subject"       
 [5] "session"        "build"          "blocknum"       "trialnum"      
 [9] "blockcode"      "trialcode"      "pretrialpause"  "posttrialpause"
[13] "trialduration"  "trialtimeout"   "response"       "correct"       
[17] "latency"       
Code
# print all column names as a vector
dput(colnames(data_demographics_raw))
c("date", "time", "group", "subject", "session", "build", "blocknum", 
"trialnum", "blockcode", "trialcode", "pretrialpause", "posttrialpause", 
"trialduration", "trialtimeout", "response", "correct", "latency"
)
Code
data_demographics_raw %>%
  colnames() %>%
  dput()
c("date", "time", "group", "subject", "session", "build", "blocknum", 
"trialnum", "blockcode", "trialcode", "pretrialpause", "posttrialpause", 
"trialduration", "trialtimeout", "response", "correct", "latency"
)
Code
data_selfreport_raw %>%
  colnames() %>%
  dput()
c("date", "time", "group", "subject", "session", "build", "blocknum", 
"trialnum", "blockcode", "trialcode", "pretrialpause", "posttrialpause", 
"trialduration", "trialtimeout", "response", "correct", "latency"
)
Code
data_amp_raw %>%
  colnames() %>%
  dput()
c("date", "time", "subject", "blockcode", "Blocknum and trialnum", 
"trialcode", "primestim", "targetstim", "correct", "latency")

5.1.3 Viewing column names and types

Code
head(data_demographics_raw) 
# A tibble: 6 × 17
  date       time        group subject session build blocknum trialnum blockcode
  <date>     <time>      <dbl>   <dbl>   <dbl> <chr>    <dbl>    <dbl> <chr>    
1 2022-06-23 10:46:30   8.66e8  5.49e8       1 6.6.0        1        2 demograp…
2 2022-06-23 10:46:30   8.66e8  5.49e8       1 6.6.0        1        3 demograp…
3 2022-06-23 11:54:55   6.31e8  5.05e8       1 6.6.0        1        2 demograp…
4 2022-06-23 11:54:55   6.31e8  5.05e8       1 6.6.0        1        3 demograp…
5 2022-06-23 12:23:32   5.69e8  9.95e8       1 6.6.0        1        2 demograp…
6 2022-06-23 12:23:32   5.69e8  9.95e8       1 6.6.0        1        3 demograp…
# ℹ 8 more variables: trialcode <chr>, pretrialpause <dbl>,
#   posttrialpause <dbl>, trialduration <dbl>, trialtimeout <dbl>,
#   response <chr>, correct <dbl>, latency <dbl>
Code
head(data_selfreport_raw)
# A tibble: 6 × 17
  date     time         group  subject session build blocknum trialnum blockcode
  <chr>    <time>       <dbl>    <dbl>   <dbl> <chr>    <dbl>    <dbl> <chr>    
1 23.06.22 12:37:34 762566308   8.93e8       1 06.0…        1        1 scale    
2 23.06.22 12:37:34 762566308   8.93e8       1 06.0…        1        2 scale    
3 23.06.22 12:26:48 569179372   9.95e8       1 06.0…        1        1 scale    
4 23.06.22 12:26:48 569179372   9.95e8       1 06.0…        1        2 scale    
5 23.06.22 12:26:48 569179372   9.95e8       1 06.0…        1        3 scale    
6 23.06.22 12:26:48 569179372   9.95e8       1 06.0…        1        4 scale    
# ℹ 8 more variables: trialcode <chr>, pretrialpause <dbl>,
#   posttrialpause <dbl>, trialduration <dbl>, trialtimeout <dbl>,
#   response <chr>, correct <dbl>, latency <dbl>
Code
head(data_amp_raw)
# A tibble: 6 × 10
  date     time     subject blockcode Blocknum and trialnu…¹ trialcode primestim
  <chr>    <time>     <dbl> <chr>     <chr>                  <chr>         <dbl>
1 23.06.22 10:46:38  5.49e8 practice  1_4                    prime_ne…         0
2 23.06.22 10:46:38  5.49e8 practice  1_5                    prime_ne…         0
3 23.06.22 10:46:38  5.49e8 practice  1_6                    prime_po…         0
4 23.06.22 10:46:38  5.49e8 test      2_1                    instruct…         0
5 23.06.22 11:55:36  5.05e8 practice  1_4                    prime_ne…         0
6 23.06.22 11:55:36  5.05e8 practice  1_5                    prime_po…         0
# ℹ abbreviated name: ¹​`Blocknum and trialnum`
# ℹ 3 more variables: targetstim <dbl>, correct <dbl>, latency <dbl>

5.2 The pipe (%>% or |>)

%>% is the original pipe created for the {magrittr} package and used throughout the tidyverse packages. It is slightly slower but also more flexible.

|> is a version of the pipe more recently added to base-R. It is slightly faster but less flexible.

If you’re not sure, it’s easier to use %>%.

5.2.1 What is the pipe?

The output of what is left of the pipe is used as the input to the right of the pipe, usually as the first argument or the data argument.

Code
library(janitor)

# use a function without the pipe
example_without_pipe <- janitor::clean_names(data_demographics_raw)

# use a function with the pipe. 
example_with_pipe <- data_demographics_raw %>%
  janitor::clean_names()

# check they produce identical results
identical(example_without_pipe, example_with_pipe)
[1] TRUE

5.2.2 Why use the pipe?

The pipe allows us to write code that reads from top to bottom, following a series of steps, in the way that humans organize and describe steps. Without the pipe, code is written from the inside out, in the way that the computer understands it but humans do not as easily.

The utility of this becomes more obvious when there are many steps:

Code
# use a series of functions without the pipe
example2_without_pipe <- summarise(group_by(mutate(rename(clean_names(dat = data_amp_raw), unique_id = subject, block = blockcode, trial_type = trialcode, rt = latency), fast_trial = ifelse(rt < 100, 1, 0)), unique_id), percent_fast_trials = mean(fast_trial)*100) 

# use a series of functions with the pipe
example2_with_pipe <- data_amp_raw %>%
  # clean the column names
  clean_names() %>%
  # rename the columns
  rename(unique_id = subject,
         block = blockcode,
         trial_type = trialcode,
         rt = latency) %>%
  # create a new variable using existing ones
  mutate(fast_trial = ifelse(rt < 100, 1, 0)) %>%
  # summarize across trials for each participant
  group_by(unique_id) %>%
  summarise(percent_fast_trials = mean(fast_trial)*100) 

# check they produce identical results
identical(example2_without_pipe, example2_with_pipe)
[1] TRUE

5.3 Using the pipe & cleaning column names

It is almost always useful to start by converting all column names to ones that play nice with R/tidyverse and which use the same naming convention (e.g., snake_case, which is standard in tidyverse).

How would you bring up the help menu to understand how janitor::clean_names() works?

Rewrite each of the below to use the pipe.

Code
data_demographics_clean_names <- data_demographics_raw %>%
  clean_names() 

data_selfreport_clean_names <- data_selfreport_raw %>%
  clean_names() 

data_amp_clean_names <- data_amp_raw %>%
  clean_names()